White Wines Quality Data Analysis by Yijia Ma

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

This data is realted to white wine. This data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. what we want to learn from this data is mainly Which chemical properties influence the quality of white wines? (11 variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH, sulphates, alcohol) From these variables, i can see there may have some correlations among these variables.

Univariate Plots Section

## [1] 4898   12
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

From this picture, we can see the quality distribution appears normal with peaking equals to 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The range of fixed.acidity is (3.8, 14.2).The most frequent part is from 6.3 to 7.3. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The range of volatile.acidity is (0.08, 1.1).The most frequent part is from 0.21 to 0.28. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The range of volatile.acidity is (0, 1.66).The most frequent part is from 0.27 to 0.39. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

From the first picture, we can see the distribution is right-skewed, so i decide to plot on log scale, The tranformed residual sugar distribution appears bimodal with the peaking around 2 or so and again at 10 or so.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The range of chlorides is (0.009, 0.346).The most frequent part is from 0.036 to 0.05. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The range of free.sulfur.dioxide is (2, 289).The most frequent part is from 23 to 46. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The range of total.sulfur.dioxide is (9, 440).The most frequent part is from 108 to 167. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The range of density is (0.9871, 1.039).The most frequent part is from 0.99 to 0.996.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The range of pH is (2.72, 3.82).The most frequent part is from 3 to 3.3. And the distribution follows normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The range of sulphates is (0.22, 1.08).The most frequent part is from 0.4 to 0.55.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

From the first picture, we can see this is a right-skewed distribution, so i decide to transform with log function, but it still a litter right skewed.

And we can see the mode is 9.4

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines with 11 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH, sulphates, alcohol). All features are numeric variables. The median and mode quality is 6.

What is/are the main feature(s) of interest in your dataset?

alcohol

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Fixed.acidity, citric acid,free.sulfur,diocide,pH and residual.sugar

Did you create any new variables from existing variables in the dataset?

No

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

No

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

From a subset of the data, alcohol, density seem to have stronger correlations with quality than other features, but residual sugar and total sulfur dioxide are moderately correlated with alcohol and density. I want to look closer at scatter plots involving quality and some variables like alcohol, density and residual sugar.

Comparing alcohol to quality, the first plot suffers from some overplotting. and the small positive correlation seen in the earlier table is easy to see here.we can see compared to low alcohol percentage, the high alcohol percentage has more high quality.

## 
## Call:
## lm(formula = df$quality ~ df$alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.582009   0.098008   26.34   <2e-16 ***
## df$alcohol  0.313469   0.009258   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Comparing density to quality, the first plot suffers from some overplotting. and the small negative correlation seen in the earlier table is easy to see here.we can see compared to low alcohol percentage, the low density has more high quality.

## 
## Call:
## lm(formula = df$quality ~ df$density)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1441 -0.6258  0.0005  0.5162  4.2102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   96.277      4.003   24.05   <2e-16 ***
## df$density   -90.942      4.027  -22.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8429 on 4896 degrees of freedom
## Multiple R-squared:  0.09432,    Adjusted R-squared:  0.09414 
## F-statistic: 509.9 on 1 and 4896 DF,  p-value: < 2.2e-16

Comparing total.sulfur.dioxide to quality, the first plot suffers from some overplotting. Most white wines have a total.sulfur.dioxide between 100 and 200 (no units), and the lack of correlation seen in the earlier table is easy to see here.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Comparing chlorides to quality, the first plot suffers from some overplotting. Most white wines have a chlorides between 0.036 and 0.05 , and the small negative correlation seen in the earlier table is easy to see here.

Comparing residual sugar to alcohol, the first plot suffers from some overplotting. and the small negative correlation seen in the earlier table is easy to see here.

Comparing density to alcohol, the strong negative correlation seen in the earlier table is easy to see here.

Comparing total sulfur dioxide to alcohol, the small negative correlation seen in the earlier table is easy to see here.

From this picture, we can easily see the there are some positive relation between residual sugar and density

Same with residual sugar, total sulfur dioxide also has some little positive relation with density

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality correlates strongly with alcohol and density.

As alcohol percentage increases, the variance in quality increases. In the plot of quality vs alcohol, there are horizonal bands where many white wines take on the different alcohol value at same quality points. The relationship between quality and alcohol appears to be exponential rather than linear.

Based on the R^2 value, alcohol only explains about 19 percent of the variance in quality. Other features of interest can be incorporated into the model to explain the variance in the quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The alcohol of a white wine tend to correlate with each other. The more alcohol, then the lesser the water and sugar. The alcohol also correlate with density and residual sugar which makes sense.

What was the strongest relationship you found?

The quality of a white wine is positively and slightly correlated with alcohol and negatively correlated with density. The variables chlorides and volatile acidity also correlate with the price but less strongly than alcohol and density. Either alcohol or density could be used in a model to predict the quality of white wines, however, both variables should not be used since they show perfect correlation.

Multivariate Plots Section

from this plot, we can see the residual sugar has strong negative relation with alcohol, but these is no obvious relation with quality.

## 
## Call:
## lm(formula = df$quality ~ df$alcohol + df$density)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5670 -0.5242 -0.0003  0.4881  3.0898 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.49170    6.16503  -3.648 0.000267 ***
## df$alcohol    0.36036    0.01478  24.389  < 2e-16 ***
## df$density   24.72842    6.07937   4.068 4.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.796 on 4895 degrees of freedom
## Multiple R-squared:  0.1925, Adjusted R-squared:  0.1921 
## F-statistic: 583.3 on 2 and 4895 DF,  p-value: < 2.2e-16

same as the residual sugar. density has relation with alcohol, but there is no obvious relation with quality

## 
## Call:
## lm(formula = df$quality ~ df$alcohol + df$chlorides)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5575 -0.5179 -0.0143  0.4913  3.1295 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.861231   0.116382  24.585  < 2e-16 ***
## df$alcohol    0.297669   0.009906  30.051  < 2e-16 ***
## df$chlorides -2.470822   0.557945  -4.428  9.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7958 on 4895 degrees of freedom
## Multiple R-squared:  0.193,  Adjusted R-squared:  0.1926 
## F-statistic: 585.2 on 2 and 4895 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I cannot find any features can strengthen each other. there’s an absence of strong correlations

Were there any interesting or surprising interactions between features?

Yes, almost all the features do not have strong relations with quality of white wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model starting from the quality and the alcohol.

The variables in the linear model account for 18.96% of the variance in the quality of white wines. The addition of the density variable to the model slightly improves the R^2 value to 19.2%.

Limitation: It cannot explain a lot of the quality variance. Strength: All the variables are significant. it mean the alcohol percentage and density can affect the quality of the white wines.

Final Plots and Summary

Plot One

Description One

The distribution of white wines quality appears to be normal. The largest amound white wines’ quality is 6, the middle one.

Plot Two

Description Two

white wines with high alcohol have the high quality.

Plot Three

Description Three

white wines with high alcohol and low density have the high quality.

Reflection

The white wines data set contains information on 4,898 thousand white wines across 13 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of white wines across many variables and created a linear model to predict white wines quality.

There was a blurry trend between the density or alcohol percentage and its quality. I was surprised that volatile acidity and citric acid did not have a strong positive correlation with quality.

Some limitations of this model:I struggled trying to increase the R^2 of the model. but without any additional findings of the strong relation with quality, my model can only explain 19.2 percentage of the quality variance.